Abstraction is Harmful in Language Learning

نویسنده

  • Walter Daelemans
چکیده

The usual approach to learning language processing tasks such as tagging, parsing, grapheme-to-phoneme conversion, pp-attachrnent, etc., is to extract regularities from training data in the form of decision trees, rules, probabilities or other abstractions. These representations of regularities are then used to solve new cases of the task. The individual training examples on which the abstractions were based are discarded (forgotten). While this approach seems to work well for other application areas of Machine Learning, I will show that there is evidence that it is not the best way to learn language processing tasks. I will briefly review empirical work in our groups in Antwerp and Tilburg on lazy language learning. In this approach (also called, instance-based, case-based, memory-based, and example-based learning), generalization happens at processing time by means of extrapolation from the most similar items in memory to the new item being processed. Lazy Learning with a simple similarity metric based on information entropy (IB I-IG, Daelemarts & van den Bosch, 1992, 1997) consistently outperforms abstracting (greedy) learning techniques such as C5.0 or backprop learning on a broad selection of natural language processing tasks ranging from phonology to semantics. Our intuitive explanation for this result is that lazy learning techniques keep all training items, whereas greedy approaches lose useful information by forgetting low-frequency or exceptional instances of the task, not covered by the extracted rules or models (Daelemans, 1996). Apart from the empirical work in Tilburg and Antwerp, a number of recent studies on statistical natural language processing (e.g. Dagan & Lee, 1997; Collins & Brooks, 1995) also suggest that, contrary to common wisdom, forgetting specific training items, even when they represent extremely low-frequency events, is harmful to generalization accuracy. After reviewing this empirical work briefly, I will report on new results (work in progress in collaboration with van den Bosch and Zavrel), systematically comparing greedy and lazy learning techniques on a number of benehrnark natural language processing tasks: tagging, grapheme-to-phoneme conversion, and pp-attachment. The results show that forgetting individual training items, however "improbable' they may be, is indeed harmful. Furthermore, they show that combining lazy learning with training set editing techniques (based on typicality and other regularity criteria) also leads to worse generalization results. I will conclude that forgetting, either by abstracting from the training data or by editing exceptional training items in lazy learning is ha_rm~ to generalization accuracy, and will attempt to provide an explanation for these unexpected results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Forgetting Exceptions is Harmful

We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneecial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a rst ser...

متن کامل

EFL Learning, EFL Motivation Types, and National Identity: In Conflict or in Coalition

The present study was aimed at examining concerns about the social effects of EFL learning, a challenging area of research which has not been discussed sufficiently. It tried to investigate the relationship between EFL learning and national identity. In addition, attempt was made to find a relationship between language motivation types and national identity. Furthermore, the role of two demogra...

متن کامل

Instance-Family Abstraction in Memory-Based Language Learning

Memory-based learning appears relatively successful when the learning data is highly disjunct, i.e., when classes are scattered over many small families of instances in instance space, as in many language learning tasks. Abstraction over borders of disjuncts tends to harm generalization performance. However , careful abstraction in memory-based learning may be harmless when it preserves the dis...

متن کامل

Applied Linguistic Approach to Language Learning Strategies (A Critical Review)

From applied linguistic point of view, the fundamental question facing the language teachers, methodologists and course designers is which procedure is more effective in FL/SL: learning to use or using to learn? Definitely, in order to be a competent language user, knowledge of language system is necessary, but it is not sufficient to be a successful language user. That is why there was a gradu...

متن کامل

Careful abstraction from instance families in memory-based language learning

Empirical studies in inductive language learning point at pure memory-based learning as a successful approach to many language learning tasks, often performing better than lerning methods that abstract from the learning material. The possibility is left open, however, that limited, careful abstraction in memory-based learning may be harmless to generalisation, as long as the disjunctivity of la...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998